Metrics to consider

  1. The ratio (#different products)/(total #reviews). Ideally we would like to have several reviews per product, thus this should be high.

  2. The similar products should be considered as alternatives for the buyers. E.g. a hair lotion for dry hair would not replace a hair lotion for greasy hair though different shovels could be considered as possible alternative choices. Unfortunately this is hard to extract thus should be decided manually.

  3. Data skewness in favor of the ratings 1 and 5. It would be easier to answer our questions when we have many 5's and 1's, thus this should be high.

  4. The dataset should be loaded instantly. In order to have short feedback loops - at least in the beginning - we need to pick datasets with small size. We can still consider the large data sets, as long as we use only a random sample of them.

  5. Existing bibliography. Data sets which have already been used by others are preferred since we can get benchmarks, exploaratory analysis data and notebook kernels we can reuse and extend.

Queries to extract the metrics


In [1]:
def average_review_number_per_product(reviews_df, reviews_count):
    distinct_products = reviews_df.select('asin').distinct().count()
    
    return reviews_count / float(distinct_products)

In [2]:
def average_reviews_per_reviewer(reviews_df, reviews_count):
    distinct_reviewers = reviews_df.select('reviewerID').distinct().count()
    
    return reviews_count / float(distinct_reviewers)

In [3]:
def percentages_per_rating(reviews_df, reviews_count):
    rating_counts = (reviews_df
         .groupBy('overall')
         .count()
         .rdd
         .map(lambda row: row.asDict().values())
         .collect())
    
    return [ (str(int(rating)), rating_count / float(reviews_count))
        for rating_count, rating
        in rating_counts ]

In [4]:
import re
import numpy as np

def evaluate_metrics(reviews_df, filename):
    name = (re
      .search('^reviews_(.+)_5\.json\.gz*', filename)
      .group(1)
      .replace('_', ' '))
    
    print(name)
    
    reviews_count = reviews_df.count()
    
    return dict(
        [ ('dataset_name', name), 
          ('number_of_reviews', reviews_count), 
          ('reviews_per_product', average_review_number_per_product(reviews_df, reviews_count)),
          ('reviews_per_reviewer', average_reviews_per_reviewer(reviews_df, reviews_count))] 
        + percentages_per_rating(reviews_df, reviews_count))

Extract the metrics from all the data files of a given directory into a pandas dataframe


In [5]:
import os
import pandas as pd

def extract_metrics_from_directory(data_directory):
    return (pd
        .DataFrame
        .from_dict(
            [ evaluate_metrics(
                    (spark
                         .read
                         .json(os.path.join(data_directory, filename))), 
                    filename)
                for filename in sorted(os.listdir(data_directory)) ])
        .set_index('dataset_name'))

metrics = extract_metrics_from_directory('./data/raw_data')
metrics.to_csv('./metadata/initial-data-evaluation-metrics.csv')


Amazon Instant Video
Apps for Android
Automotive
Baby
Beauty
Cell Phones and Accessories
Clothing Shoes and Jewelry
Digital Music
Grocery and Gourmet Food
Health and Personal Care
Home and Kitchen
Kindle Store
Office Products
Patio Lawn and Garden
Pet Supplies
Sports and Outdoors
Tools and Home Improvement
Toys and Games
Video Games

In [9]:
metrics.sort_values(['number_of_reviews'], ascending=False)


Out[9]:
1 2 3 4 5 number_of_reviews reviews_per_product reviews_per_reviewer
dataset_name
Kindle Store 0.023425 0.034734 0.097896 0.258506 0.585440 982619 15.865583 14.403046
Apps for Android 0.104541 0.058949 0.113052 0.209952 0.513505 752937 57.001817 8.627574
Home and Kitchen 0.049133 0.044071 0.081676 0.191248 0.633872 551682 19.537557 8.293600
Health and Personal Care 0.047772 0.048372 0.096011 0.196815 0.611029 346355 18.687547 8.970836
Sports and Outdoors 0.030523 0.034434 0.081228 0.218700 0.635115 296337 16.142997 8.324541
Clothing Shoes and Jewelry 0.040161 0.055487 0.109177 0.209407 0.585768 278677 12.099032 7.075355
Video Games 0.064082 0.058948 0.121991 0.236448 0.518531 231780 21.718516 9.537094
Beauty 0.053027 0.057712 0.112079 0.200205 0.576977 198502 16.403768 8.876358
Cell Phones and Accessories 0.068294 0.056902 0.110261 0.205684 0.558859 194439 18.644069 6.974389
Toys and Games 0.028085 0.037578 0.097597 0.223423 0.613316 167597 14.055434 8.633680
Baby 0.048628 0.057173 0.107313 0.205228 0.581658 160792 22.807376 8.269067
Pet Supplies 0.055425 0.056432 0.100947 0.177368 0.609829 157836 18.547121 7.949033
Grocery and Gourmet Food 0.038207 0.052342 0.115792 0.215518 0.578140 151254 17.359578 10.302704
Tools and Home Improvement 0.038245 0.036899 0.080081 0.210714 0.634061 134476 13.161985 8.082462
Digital Music 0.043134 0.046518 0.104921 0.255556 0.549872 64706 18.135090 11.677676
Office Products 0.021217 0.032408 0.095009 0.281929 0.569436 53258 22.007438 10.857900
Amazon Instant Video 0.046275 0.050773 0.112778 0.227496 0.562678 37126 22.033234 7.237037
Automotive 0.026474 0.029600 0.069848 0.193767 0.680311 20473 11.156948 6.992145
Patio Lawn and Garden 0.039105 0.050708 0.125000 0.254973 0.530214 13272 13.796258 7.871886